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Abstract 

N-body codes for performing simulations of the origin and evolution of the large scale 
structure of the universe have improved significantly over the past decade in terms of both 
the resolution achieved and the reduction of the CPU time. However, state-of-the-art N-body 
codes hardly allow one to deal with particle numbers larger than a few 10 7 , even on the largest 
parallel systems. In order to allow simulations with larger resolution, we have first reconsidered 
the grouping strategy as described in J. Barnes (1990, J. Comput. Phys. 87, 161) (hereafter 
B90) and applied it with some modifications to our WDSH-PT (Work and Data SHaring - 
Parallel Tree) code (U. Becciani et al., 1996, Comput. Phys. Comm. 99,1). In the first part of 
this paper we will give a short description of the code adopting the algorithm of J. E. Barnes 
and P. Hut (1986, Nature, 324, 446) and in particular the memory and work distribution 
strategy applied to describe the data distribution on a CC-NUMA machine like the CRAY- 
T3E system. In very large simulations (typically N > 10 7 ), due to network contention and 
the formation of clusters of galaxies, an uneven load easily verifies. To remedy this, we have 
devised an automatic work redistribution mechanism which provided a good dynamic load 
balance without adding significant overhead. In the second part of the paper we describe the 
modification to the Barnes grouping strategy we have devised to improve the performance of 
the WDSH-PT code. We will use the property that nearby particles have similar interaction 
lists. This idea has been checked in B90, where an interaction list is built which applies 
everywhere within a cell C gr0 up containing a small number of particles N C rit- B90 reuses this 
interaction list for each particle p £ C gr0 up in the cell in turn. We will assume each particle 
p to have the same interaction list. We consider that the agent force F p on a particle p 
can be decomposed into two terms F p = F/ ar + F„ ear . The first term F/ ar is the same for 
each particle in the cell and is generated by the interaction between a hypothetical particle 
placed in the center of mass of the C gr0 up and the farther cells contained in the interaction 
list. F near is different for each particle p and is generated by the interaction between p and 
the elements near C gr0 up. Thus it has been possible to reduce the CPU time and increase 
the code performance. This enable us to run simulations with a large number of particles 
(N ~ 10 7 -f- 10 9 ) in nonprohibitive CPU times. 



1 INTRODUCTION 

N-body codes are one of the most important tools of theoretical cosmology |J because they offer 
the possibility of simulating most of the gravitational processes driving the formation of the large 
scale structure of the universe (hereafter LSS) || jl5| |ll[] . These simulations are often used to check 
cosmological models and to constrain the free parameters of these models which cannot be fixed 
either theoretically or observationally. 
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The typical mass scale for gravitational instability, the Jeans mass, has a value of rj 10 6 ' 5 solar 
masses (1 solar mass ~ 1.9 x 10 33 <7) at the recombination epoch, and it gives the approximate 
size of the first objects forming by gravitational collapse at that epoch. On the other hand, the 
largest structure we see in our Universe today, the "Supercluster" of galaxies, has a mass in excess 
of rs 10 18 solar masses. Moreover, the gravitational force has a truly long-range character, which 
makes it impossible to introduce reasonable upper cutoffs in the mass range. For all these reasons, 
one would like to be able to perform simulations spanning more than 12 orders of magnitude in 
mass, but present-day state-of-the-art software and hardware technology does not allow simulations 
with more than rj 10 9 bodies. For these reasons, the quest for increasingly efficient algorithms 
is still in progress. However, the importance of making N-body simulations is clear to several 



authors [p.2|]19||20 . During the past years N-body codes have been much improved and applied 
successfully to various problems in galaxy dynamics, galaxy formation, and cosmological large 
structure formation. Nevertheless, the computational expense has remained prohibitive for N > 
10 9 , even using tree-based algorithms on the most powerful computers. 

The situation is even worse for other N-body algorithms. The N-body direct evolution method 
scales as 0(N 2 ), which makes it impossible to run simulations with more than 10 4 particles. To 
overcome this difficulty, and when high accuracy is required, alternative numerical methods based 
on hierarchical force-computation algorithms are widely used. The recent effort has addressed the 
production of new software and algorithms for the new generation of high-performance computer 
systems. The ultimate target is an implementation of the tree N-body algorithm to run simulations 
with higher accuracy and particle number, decreasing the cost of the simulation in terms of CPU 
time and increasing performance in terms of number of particles/second elaborated when running 
on MPP systems. 

Among the tree algorithms designed to compute the gravitational force in N-body systems, one of 
the most used and powerful in modern cosmology is that by Barnes and Hut (BH) [EJ. The BH 
octal-tree recursive method is inherently adaptive and allows one to achieve a higher mass resolution 
even if parallel implementation of this algorithm R|[p3| suffers from a serious drawback: it can 
easily run into imbalance as soon as the configuration evolves, causing performance degradation. 
In this paper we present a modified version of the BH algorithm in which we have introduced an 
enhanced grouping strategy. We will show how this feature allows an increase in performance when 
we consider N-body simulation with a large number of particles (N > 10 6 ). The code we present 
incorporates fully periodic boundary conditions using the Ewald method, without the use of fast 
Fourier transform techniques |l6| . 

In Section 2 we give a brief description of our N-body parallel code, based on the BH tree algorithm, 
and the dynamic load balance (DLB) policy adopted. In Section 3 we describe our enhanced 
grouping strategy. In Section 4 we show the results of our tests and in Section 5 we report our 
conclusions. 



2 THE PARALLEL CODE 

Since the publication of the monograph by Hockney and Eastwood in 1981 p7| , a new class of 
particle simulation methods @| f| fig}] |plf |22| has emerged as an alternative to particle-particle 
(PP) Dllil particle-mesh (PM) (for a review of this method see and particle-particle- 
particle-mesh (P 3 M ) |ll| methods. These new methods are characterized by the particles being 
arranged into a hierarchy of clusters, which span the full range of length scales from the minimum 
interparticle spacing up to the diameter of the entire system. These methods are usually known as 
tree methods or tree codes because of the data structures used. With these new methods, the short- 
range force on a particle p is calculated as a direct sum over nearby particles. Remote bodies are 
organized into groups which become progressively larger with the distance from the particle; then 
a multipole expansion of the potential of each cluster about its center of mass is performed. The 
long-range contribution to the acceleration is given by the sum of the particle-cluster interactions. 
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Figure 1: Interaction list formation in BH code. 



2.1 The Barnes-Hut Tree Algorithm 

The BH algorithm works using a hierarchy of cubes arranged in an octal-tree structure; that is, 
each node in the tree has eight siblings and each node represents a physical volume of the space. 
The total mass of all particles within a given volume and their centers of mass are stored at the 
corresponding node. Thus, the system is first surrounded by a single cell (cube) encompassing all 
the particles. This main cell (called root) is subdivided into eight subcells of equal volume, each 
containing its own subset of particles. Each subcell in turn is subdivided into eight new subcells 
and so on. This procedure is repeated until each cell at the lowest level contains only one particle. 
. The force on any given p is then the sum of the forces by the nearby particles plus the force by 
the distant cells whose mass distributions are approximated by multipole series truncated typically 
at the quadrupole order p"5| . The criterion for determining whether a cell is sufficiently distant 
for a multipole force evaluation (that is, for approximating the cell as a multipole) is based on an 
opening angle parameter 9 given by 

f<», a) 

where C/ is the size of the cell and d is the distance of p from the center of mass of the cell. Smaller 
values of 9 lead to more cell opening and more accurate forces (for 9 = 1 we have an error lower 
than 1% on the accelerations EBI). The equations of the dynamics are solved using the Leapfrog 
integrator. 



2.2 Data Distribution and DLB 

In our parallel implementation of the BH tree algorithm, using the PGHPF/CRAFT (an imple- 
mentation of High Performance Fortran by the Portland Group) programming environment for the 
Cray T3E system, we have exploited both the Data Sharing and the Work Sharing programming 
models. The flexibility of the PGHPF/CRAFT environment allows one to mix these two modes 
in order to gain the maximum efficiency and speed-up. We can distinguish two main phases in 
our code structure: the Tree_formation (TF) and the Force_compute (FC). A data distribution in 
contiguous blocks 

!HPF$ DISTRIBUTE PARTICLE_ATTRIBUTE(BLOCK,*) 
and alternatively, a fine grain distribution 

!HPF$ DISTRIBUTE TREE_ATTRIBUTE( CYCLIC , * ) 
were adopted to distribute the particle data properties and the tree data properties. 



3 



The !HPF$ DISTRIBUTE directive of the PGHPF/CRAFT compiler allows us to consider an 
array like PARTICLE_ATTRIBUTE (or TREE_ATTRIBUTE) as a unique large array, accessible 
from all the processors, the array being physically distributed in the local memory of all the pro- 
cessors. We used two different sets of initial conditions, namely uniform and clustered distributions 
having 2 million particles each, and they were carried out using from 16 to 128 PEs. Our results 
show that the higher code performances are obtained using a fine grain tree data distribution and 
a coarse grain bodies data distribution. A detailed description can be found in jf| . The static array 
distribution, fixed as described above, allows each PE to cooperate during the TF phase by using 
principally the DO INDEPENDENT structure of PGHPF that is a synchronous mechanism, and 
then to execute the FC phase in asynchronous mode. To minimize the communication overhead, 
each PE executes the FC phase mainly on the local residing bodies. The BLOCK distribution 
arranges bodies with the nearest logical number (near in the space) in the same PE local memory, 
or in the nearest PEs. Using the above mentioned data distribution, each PE has a block of closed 
bodies in the local memory (Np = Nf 10 d/N$PEs, where N$PEs is the number of processors used 
for the simulation) ;in an initial condition with a uniform distribution, the PEs having extreme 
numeration in the pool of available PEs have a lower load at each time-step. The load imbalance is 
enhanced when a clustered situation occurs during the system evolution. The PEs having bodies 
in clustered regions have a greater workload since the load of the FC phase increases as the mass 
density grows. The technique we follow to perform a load redistribution among the PEs is to assign 
each PE to execute this phase only for a fixed portion of the bodies residing in the local memory 
NBi p given by 

NB lp = {N bod /N$PEs)P lpi (2) 

where Pi p — const. (0 < Pi p < 1). 

The FC phase for all the remaining bodies 

N f = N$PEs(N bod /N$PEs)(l - P lp ) (3) 

is executed by all the PEs that have concluded the FC phase for the assigned NE>i p bodies. No 
correlation is considered between the PE memory location of the body belonging to the Nf set and 
the PE that computes the FC phase on it. The results imply that it is possible to fix a Pi p value 
that allows the best code performances. Data already presented in || show that it is convenient 
to fix the Pi p value near 0.25, which is the value that maximizes the load balance for N-body 
simulations of the LSS both in uniform and clustered situation. 

3 GROUPING 

Our work and data sharing-parallel tree (WDSH-PT) code is principally aimed at running LSS 
cosmological simulations with a number of particles as high as possible using supercomputers such 
as Cray T3E systems. In order to increase the code efficiency, we adopt initially the grouping 
method proposed by Barnes Q and introduce a modified implementation of his grouping policy 
yielding very high gains in the code performances with the same accuracy. 

3.1 B90 Grouping 

To compute the force on a body, the BH algorithm needs to build an interaction list (IL) for each 
particle p. Starting from the root cell, a tree inspection is done and the opening angle parameter 
6 is used to evaluate whether a cell must be opened or closed as mentioned above. If a cell has 
dimension C/ and distance d from the particle p so that Eq. (|]) is verified, the cell is closed, it is 
added to the IL, and its subcells are not investigated further. Otherwise the cell is opened and its 
subcells are investigated in the same way. Bodies belonging to an opened cell are added to the IL. 
Next, the force on the body is computed using the monopole and quadrupole momenta for all the 
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Figure 2: Interaction list average length for BH and B90 algorithms. 



cells in the list. 



BH timing The tree inspection phase represents a sizeable task to compute the force because 
the cell opening criterion is applied many times for each particle. The CPU time T Q to compute 
the force in a time-step for all the N particles is 

T = N(Ti)+N{T f ), (4) 

where (TJ) is the average time to build an IL and (Tf) is the average time to compute the force 
on each particle using the interaction list. 



B90 timing The basic idea of B90 was to build a unique interaction list that allows the force for 
a group of particles inside a region; i.e., a cell C gr0 up of the tree (grouping cell), to be computed 
reducing the number of tree inspections to build the ILs. B90 builds an IL that applies everywhere 
within Cgroup and reuses this IL for each particle p e C group in turn. In this way it is possible to 
reduce the tree inspection phase. The CPU time T g for B90 may be written as 

T g =N gc (T gl )+N{T gf ), (5) 

where 

• N gc is the number of grouping cells (assuming that each body is inside a group region); 

• (T g i) is the average time to build an interaction list for a group; 

• (T g f) is the average time to compute the force on a particle using the list formed for the 
group. 
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Figure 3: Measured (Tj) and (T gl ) using WDSHPT code in a Cray-T3E 1200 system. 



In the following paragraphs we will compare the T g time with the T Q time considering the 
generic case 9 = 0.8 . We notice that different values of 9 give similar results, as shown by the 
accompanying figures. 

B90 opening criterion The original BH algorithm adopts an opening criterion 9bh, based on 
the distance between the position of the p particle and the center of mass of the remote cells, the 
IL length being proportional to {0 BH ) 1 logN . In order to have the same accuracy as the 

original algorithm, the interaction list of the grouping cell is formed using Eq. (§) but now the 
d term is computed in terms of the distance from the center of mass of an inspected cell and the 
edge of the grouping cell, as shown in Fig. 1 (dsgo is used instead of dsn)- 

This implies that the IL formed using the grouping cell contains more elements than the IL 
formed by applying the original BH algorithm. 



B90 interaction list increment The B90 adopts an opening criterion (9b9o) based on the 
distance between the edge of C group and the center of mass of the remote cells. In this case the 
IL g length will be proportional to 9 B9Q logN . Moreover, the B90 criterion uses 9b90 numerically 
equal to 9bh when using the original BH algorithm. We found experimentally the relation between 
9b9o and 9bh, using 2 million particles in a uniform distribution (see Fig. 2): this relation agrees 
with data in Salmon ]2jj] . A typical value used as opening criterion to run simulations for the LSS 
is 9bh = 0.8. Consequently we consider 9sgo — 0.8 , which, in terms of IL length increment, 
corresponds to running a simulation with 9bh = 0.6. 



B90 timing vs BH timing Figs. 3 and 4 show the relationship (T gl ) - (T ( ) and (T gf ) - (T f ) 
with 9 ranging between 0.4 and 1.2. Considering (T g i) = 1.3(7]) and (T gf ) = 2.2(7)) with 9 = 0.8, 
Eq. (1q) may be rewritten using the above relations as follows: 
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Figure 4: Measured (T» and (T gf ) using WDSHPT code in a Cray-T3E 1200 system. 



T g = 1.3-—(T l ) + 2.2N(T f ), (6) 

where ^ r = -/V ffC , and (N gv ) is the average number of particles in a grouping cell. 
For a large number of systems and in particular for our WDSH-PT code running on the T3E 
system, (Tf) ranges from ~ 1.2(TJ) to ~ 1.5(T;) at 9 — 0.8. Considering these figures, we obtain 
from Eqs. (||) and (g), respectively, 

T = 2.2N(T t ) (7) 

and 

T = 2.2N(T l )( — + 1.2) (8) 
9 ^ uy 2.2(N gp ) ' V ' 

and then T g > T Q . 

A real gain could be obtained using B90 if the CPU time spent to form the interaction list were 
longer than the phase to compute the force on the particle. The results reported in Q demonstrate 
that if (Ti) S> (Tf) the code performance is between two and three times faster of the BH algorithm, 
and a good choice for the N cr n value is about 32 |fl2| . 

3.2 The Modified Grouping Strategy for LSS Simulations 

We will now describe the modification we introduce in the 1999 version of our WDSHPT code 
(WD99) to increase performance even if (TJ) < (Tf). The basic idea is to assign the same IL g to 
each particle within a cell C group , containing a maximum of N cr u particles. 

We will not use the B90 criterion to build the interaction list. Instead, we will use the same Obh 
criterion used in the original BH algorithm. This criterion is applied to a hypothetical particle 
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Figure 5: IL g formation using the Sphere criterion. 



placed in the center of mass of the C group , hereafter VB (Virtual Body) (Fig. 5). Moreover, we 
consider the IL g as formed by two parts given by 

IL g = ILf ar + IL near (9) 

ILf ar and IL near being two subsets of the interaction list. An element is included in one of the 
two subsets, using the following Sphere criterion for all the elements that satisfy Eq. (pi). 



Dpfinp inhere j- — o Cellslze{ - c a^v)^ 
j_/t:illlt: o puc-l ^radius — <J 2 



If Distance(IL g (element),VB) > Sphere ra di 

Add element to IL far- 
Else 

Add element to IL near 

Endif 



Moreover all p € C group are included in IL near . 

Using the two subsets it is possible to compute the force F p on a particle p 6 C group as the 
sum of two components, 

Fp Fy ar -f- F near , (10) 

where Ff ar is a force component due to the elements listed in ILf ar and F near is the force 
component due to the elements in IL near . We assume the component Ff ar to be the same for 
each particle p € C group and compute it considering the gravitational interaction between the VB 
and only the elements listed in ILf ar , while the F near component is computed separately for each 
p particle by the direct interaction with the elements listed in IL near . 
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Moreover, F near contains a restricted number of elements in comparison with the Ff ar list, so 
we expect a net gain in performance even if 7} < Tf. The gain that is possible depends on several 
parameters (N cr i t , the size of the C group and the Sphere radius), whose ranges of variation are 
constrained by the maximum allowed value of the overall error of the method, as we will describe 
in the following sections. 

3.3 Errors Analysis and Performance considerations 

Before showing the performance of our WD99 procedure in an N-body simulation of the large scale 
structure of the universe, it is important that we perform an error analysis of the procedure itself. 
Considering that the cumulative error, when simulations for the LSS studies are run using the 
original BH algorithm, is lower than 1% J(|, fixing the opening criterion 6 = 1, we will give some 
constraint concerning the size of C gr0 up , the N cr u value, and the Sphere radius needed to have 
negligible cumulative error. The following sub-sections discuss the two main sources of error. 

3.3.1 The Differences in the Interaction List 

The first error source is that WD99 uses the Sphere criterion and the VB to create an interaction 
list IL g = ILvb and WD99 applies the ILvb to all bodies p E C group . This approximation could 
introduce an error in the force value on the p particle if the IL p , created using the original BH 
algorithm, and the ILvb have a difference in the elements greater then 1%. As we found with our 
tests, in order to decrease this difference it is necessary to limit the size of C group that is equivalent 
to fixing a critical level of the tree structure: cells above the critical level cannot form a grouping 
cell. The user has to fix the critical level considering the density of local bodies in the box where 
bodies are arranged: the critical level must be chosen in such a way as to make the difference 
between IL p and ILvb negligible (no more than 1% of the elements). It seems reasonable for a 
LSS simulation in a 50 Mpc box with more than 2 million particles to fix the critical level as the 
sixth level of the tree. The next section shows in detail the obtained results. In any case, the 
cumulative error is very small considering the increase in accuracy compared with the original BH 
algorithm, due to the inclusion of all p G C group in the interaction list. 

3.3.2 Approximation of the Force Component 

The second error source is due to the assignment of F f ar , computed for the VB, to each;? £ C group . 
We found that the Sphere criterion allows us to reduce this error to values much lower than 0.01% 
for N > 10 6 if the dimension of the C group cell is fixed with the critical level as mentioned above, 
and the Sphere radius is three times the radius of the sphere enclosing the C group cell (Fig. 5). 
Another important constraint to be fixed is the value of N cr u. All the elements p € C group are 
listed in the IL near list and there is a direct body-body interaction among the N gp (N gp < N cr i t ) 
elements forming the group. This introduces a term 0(N gp N p ) in the algorithm complexity. In 
order to avoid a decrease of the code efficiency and to maintain a good code accuracy, as with 
the original BH algorithm, it seems reasonable, running LSS simulation with more than 1 million 
particles, to maintain N cr u < 32. We adopted, in our runs, a safe value N cr i t = 16 even if we 
obtained good results with N crit = 32. 

In the next section we show the errors obtained using the above-mentioned constraints when 
applying WD99 to LSS simulation with both uniform and clustered distributions. 

4 TESTS AND RESULTS 

We carried out many tests to estimate the error introduced in the WD99 and obtained increased 
performances using several values of N crit . Therefore, this section is subdivided as follows: first we 
test whether our algorithm increases the average length of the interaction list, then we measure the 
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Figure 6: Measurement of the average Interaction List length running the original 
BH code and our WD99 using critical level from 5 to 8. 



resulting percentage error, and we conclude with an overall performance analysis. As a test case 
we ran a simulation using 2 million particles for LSS in a cubic region of 50 Mpc, starting from a 
homogeneous initial condition (rcdshift Z = 50) and reaching a clustered configuration (redshift 
Z = 0). We used an opening angle parameter 9 ranging from 0.8 to 1.2. Our tests were executed 
on a Cray T3E system and the results will be shown in the following sections. 

4.1 Measuring the Interaction List Length 

The aim of this first test is to verify that the WD99 algorithm does not introduce a significant 
computational cost when the force for a generic particle is computed. This measurement is sub- 
stantially performed on the average length of the IL we form adopting our code. Fig. 6 reports 
the result we obtain when the simulation evolves at redshift Z = 50. 

Tests were executed for several values of redshift, but the differences between BH and our 
algorithm was computed only at the end of the run. The curves were obtained by fixing N cr u = 32 
and varying the critical level from 5 to 8. In all cases the differences we obtained are negligible, 
which means that the computed IL for the VB (with the adopted Sphere criterion) is about equal 
to the IL we obtain for a generic particle with the original BH algorithm. The first important result 
is that WD99 does not produce any increment in the IL length and consequently (Tj) = (T g i). 

4.2 Error Measurement 

We carry out this measurement in two phases. First we run a single time-step of the 2-million- 
particle simulation at rcdshift Z = 50. We compare the values we obtain running the BH original 
algorithm and the WD99. As a reference case, we adopt the critical level equal to 6. 

A similar comparison is made at Z — and the BH and WD99 histograms of the forces of each 
component are compared. The comparison shows a negligible difference in the force distribution in 
a single time-step, at least an order of magnitude less than the error of the original BH algorithm. 
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The second measurement is made analysing an entire system evolution. We start with the initial 
condition of 2 million particles with redshift Z = 50, At = 0.001, 9 = 0.8, and particle mass about 
1.655 ■ 10 10 solar masses. The system evolution is carried out up to redshift Z = 0. The evolution 
is executed with the original BH algorithm and with the WD99 code. As reference case, we adopt 
the critical level equal to 6. We measure the absolute error e in the position of particles and in the 
velocities of particles in the mean square sense, 



N 



^^{Xbh — XwD99) 2 + (Ybh — Y\VD99) 2 + {Zbh — Zw D99Y 



(11) 



and 



N 

/^2(Vxbh - Vx WDg9 ) 2 + (Vy B H - VywD99) 2 + (Vz B h - Vz WD99 ) 2 . (12) 
i=i 

The study of the final evolution is described in the next sub-section. Here we give the measured 
value at the end of the simulation: e pos = 0.003 and e ve i = 0.01. 

Similar values are measured when running simulations with more than 2 million particles and 
with a critical level equal to 6. The obtained e values lead us to conclude that the WD99 procedure 
does not introduce significant errors in comparison with the BH algorithm. 



1 

N 



\ 



4.2.1 Simulation Analysis 

The final stages obtained running simulations with the BH algorithm and the WD99 code are very 
similar. The two-point correlation function is defined as 

N q being the number of pairs of particles with separations between r and r + Ar, V the volume 
considered, N c the particle number taken as centres, and (p) the mean particle density. We 
calculate this function at redshift Z = for WD99 and BH algorithms. The values we obtain are 
perfectly equal, and the substructures we form (number and size) are identical. 



4.3 WD99 Performances 

To conclude our WD99 description, we report the performances measured for the WD99 code (Fig. 
7) (including the boundary periodic conditions using the Ewald method |l6|l ) and the performances 
of the original BH algorithm. The measured performances lead us to the conclusion that when 
the system evolution is clustered (Z = 0) the WD99 does not decrease the performance as the BH 
algorithm. This effect is due to the nature of the WD99 algorithm, which has a structure that 
increases the efficiency when clusters of particles are well closed. This important effect allows us 
to run simulations with very clustered systems, obtaining very good performance and negligible 
errors. Moreover, the efficiency of the WD99 increases by a factor of up to five at the redshift 
Z = 0. The gain is enhanced when bigger simulations are run: a recent simulation with 16 million 
particles performed on the Cray T3E system using WD99 showed an increase in performance by 
a factor of 7 at the redshift Z = 0. We note that the gain obtained, in comparison with that 
obtained by the original BH algorithm, is greater using a lower critical level (5 or 6). The gain is 
incremented using a Sphere criterion with Sphere radius lower than the value we consider, having 
only a small increment in the global error. 
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Figure 7: A 2 Million of particles at 8 = 0.8: BH algorithm and WD99 procedure 
performances. A critical level equal to 7 is fixed, and the measurement are performed 
for uniform (Z = 50) and clustered (Z — 0) system conditions. The y scale measure 
the number of particle/sec we compute. 



5 CONCLUSIONS AND FUTURE 

The code WD99 is mainly used for LSS studies, but it could be tested and used for other ap- 
plications where accuracy not higher than 1% is necessary. Considering the high performances 
we obtained, the WD99 method may be very successfully applied when clustered configurations 
such as galaxies or clusters of galaxies have to be studied. The new approach could be applied 
also to other fields of physics where collisionlcss systems arc to be simulated, as in plasma and 
hydrodynamic studies. 

The code is written in Fortran 90 with PGHPF/CRAFT, but the latest version (written in F90 and 
C languages) uses the one-side communication library SHMEM, allowing it to run on the ORIGIN 
2000 systems. A new version will be implemented using dynamical array allocation, and we are 
studying the implementation of the parallel out-of-core p5[ , moving data in the disk. This version 
will be developed for a CC-NUMA machine with MPI-2. We plan to have a freely available version 
of WD99 in October 2000. 
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